Data Errors

identify_errors

pywrangle.data_errors.identify_errors.identify_errors(df: dataframe, column: str, threshold: int = 65, show_progress: bool = False, limit: int = 5) → None

Prints potential data errors in the specified DataFrame column.

Parameters
  • df (dataframe) – DataFrame.

  • column (str) – Column in DataFrame to check.

  • threshold (int) – Rigor threshold to identify potential data errors. A higher threshold returns more rigorous matching. Defaults to 65 out of 100.

  • show_progress (bool) – Prints matching progress to console. Defaults to False.

  • limit (int) – Limits the number of matches to each string. Higher values increase computation time and return more false positives. Defaults to 5.

Notes

  • Data entry errors are identified based on a Similarity Index.

  • The Similarity Index is calculated using algorithm’s derived from levenshtein’s distance and doublemetaphone.

Example

>>> df = create_df.create_str_df2()
## Identify potential errors in the state column
>>> pw.identify_errors(df= df, column= 'states', threshold= 70)
Record   |   String         |   Match          |   Similarity Index
------   |   ------------   |   ------------   |   ----------------
    1    |   california     |   californi as   |              92.75
    2    |   california     |   californi a    |               97.0
    3    |   california     |   californias    |              94.25
    4    |   california     |   cali fornia    |               96.0

converge_sim_vals

pywrangle.data_errors.converge_sim_vals.converge_sim_vals(df: DataFrame, column: str, values: Union[tuple, list], index: int) → DataFrame

Returns DataFrame with similar values ‘converged’ to the value at index.

Parameters
  • df (DataFrame) – DataFrame to change.

  • column (str) – Column name.

  • values (Union[tuple, list]) – Values to change.

  • index (int) – index in values for similar values to converge.

Notes

  • This function can be called after identifying errors with the identify_errors function.

Example

>>> df = create_df.create_str_df4()
>>> print(df)
        Index       States
    0      1    california
    1      2    california
    2      3   cali fornia
    3      4   californias
    4      5   californi a
    Index(['Index', 'States'], dtype='object')

>>> values = ['california', 'cali fornia', 'californias', 'californi a']
>>> df = pw.converge_sim_vals(df= df, column= 'States', values= values, index= 0)
>>> print(df)
        Index      States
    0      1   california
    1      2   california
    2      3   california
    3      4   california
    4      5   california
    Index(['Index', 'States'], dtype='object')